Building Transactional Frequency Models
In this workbook we investigate different ways to model the transaction frequency of individual customers, with a view to expanding this approach into a more traditional P/NBD model.
1 Load and Configure Datasets
We first want to load some synthetic transaction data.
customer_cohort_tbl <- read_rds("data/synthdata_singleyear_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()## Rows: 50,000
## Columns: 4
## $ customer_id <chr> "8JRWQHAXGE", "8D1JREK9KR", "OXUF71I25Q", "1SCZ4QOQ4Y",…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", …
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", …
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-0…
customer_simparams_tbl <- read_rds("data/synthdata_singleyear_simparams_tbl.rds")
customer_simparams_tbl %>% glimpse()## Rows: 50,000
## Columns: 8
## $ customer_id <chr> "000JSNK3XH", "001A7TLTYV", "004GMUBSCD", "004LAX8P44"…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1",…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01",…
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
## $ customer_tau <dbl> 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53…
## $ customer_lambda <dbl> 0.63746283, 0.05247298, 0.06824901, 0.20599461, 1.1217…
## $ customer_nu <dbl> 0.449753057, 0.456665157, 0.263130590, 0.007679448, 1.…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
customer_transactions_tbl <- read_rds("data/synthdata_singleyear_transactions_tbl.rds")
customer_transactions_tbl %>% glimpse()## Rows: 919,470
## Columns: 5
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 19:52:38, 2019-01-16 20:04:48, 2019-01-23 23…
## $ tnx_amount <dbl> 234.83, 213.38, 214.90, 180.31, 234.99, 245.72, 226.07, …
We also want to set up a number of parameters for use in this workbook
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"1.1 Construct Frequency Data
customer_summarystats_tbl <- customer_transactions_tbl %>%
calculate_transaction_summary_stats()
customer_summarystats_tbl %>% glimpse()## Rows: 50,000
## Columns: 9
## $ customer_id <chr> "000JSNK3XH", "001A7TLTYV", "004GMUBSCD", "004LAX8P44", "…
## $ tnx_count <int> 29, 3, 2, 16, 60, 9, 42, 2, 7, 132, 27, 2, 10, 3, 13, 5, …
## $ first_tnx_ts <dttm> 2019-01-01 19:52:38, 2019-01-01 01:53:18, 2019-01-01 18:…
## $ last_tnx_ts <dttm> 2019-11-19 10:48:17, 2019-12-25 16:49:55, 2019-08-20 05:…
## $ btyd_count <dbl> 28, 2, 1, 15, 59, 8, 41, 1, 6, 131, 26, 1, 9, 2, 12, 4, 1…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 45.94600, 51.23181, 32.91996, 50.42715, 51.90082, 45.4073…
## $ obs_freq <dbl> 0.60941109, 0.03903825, 0.03037671, 0.29745883, 1.1367836…
## $ emp_freq <dbl> 0.53846154, 0.03846154, 0.01923077, 0.28846154, 1.1346153…
We want to view the transactions to get a sense of how regular our transactions are in general.
plot_tbl <- customer_transactions_tbl %>%
group_nest(customer_id, .key = "cust_data") %>%
filter(map_int(cust_data, nrow) > 3) %>%
slice_sample(n = 30) %>%
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))1.2 Construct Data Subsets
For the purposes of saving on time and computation, we construct a subset of
the data first, and rather than constructing different samples each time we
instead construct sets of customer_id to draw upon, then using those list
of values when we want to take a subset.
We do this by randomly shuffling the data and then selecting the top \(n\) values
of customer_id.
shuffle_tbl <- customer_summarystats_tbl %>%
slice_sample(n = nrow(.), replace = FALSE)
id_1000 <- shuffle_tbl %>% head(1000) %>% pull(customer_id) %>% sort()
id_5000 <- shuffle_tbl %>% head(5000) %>% pull(customer_id) %>% sort()
id_10000 <- shuffle_tbl %>% head(10000) %>% pull(customer_id) %>% sort()We now have a list of customer_id values we use to subset the data.
2 Use of Conjugate Priors
Some combinations of likelihood model and prior distribution combine in a convenient way.
In particular, if this combination integrates to a posterior distribution that has the same distribution family as the prior, the prior is said to be a conjugate prior for that likelihood.
The major benefit of this situation is that the integration step can be skipped entirely: the result is that the parameters of conjugate prior are combined with summary statistics of the observed data to get our posterior distribution.
Most commonly-used likelihood models have conjugate priors, and the Poisson distribution has the Gamma as a conjugate prior.
In particular, suppose we have a Gamma distribution prior on \(\lambda\):
\[ \lambda_{prior} \sim \Gamma(r, \alpha) \] and we observe \(x\) events over a period \(n\), our posterior distribution for \(\lambda\) is now.
\[ \lambda_{post} \sim \Gamma(r + x, \alpha + n) \]
2.1 Statistical Inference on Frequency Rates
Suppose we have new customer join our website and observe a number of transactions from that customer over the following year. For now, we ignore churn and assume the customer was an active customer for that whole year.
How do we combine our prior knowledge and observed transactions to perform inference on that customers transaction rate?
From the above section, we see that calculating the posterior distribution for \(\lambda\) is simple, so we now investigate the consequences of this by looking at how different sets of observations and priors combine.
2.2 Initial Examples
In terms of our prior distribution, we set our prior based on average expectations of the population of interest. Suppose the average customer transacts on our site once a month, or once every four weeks.
Since we are using weekly time periods as our reference point, this means that if we were calculating a point estimate it would be 0.25.
For the purposes of Bayesian analysis though, this is insufficient - we need a distribution for \(\lambda\) that captures our prior knowledge.
Narrowing down to a Gamma distribution, the mean for our Gamma is
\[ \bar{x} = \frac{r}{\alpha}. \]
So now we look at a few different parameter combinations with this mean.
x_vals <- seq(0.01, 1, by = 0.001)
y1_vals <- dgamma(x_vals, shape = 0.1, rate = 0.4)
y2_vals <- dgamma(x_vals, shape = 1.0, rate = 4.0)
y3_vals <- dgamma(x_vals, shape = 2.0, rate = 8.0)
y4_vals <- dgamma(x_vals, shape = 10.0, rate = 40.0)
plot_tbl <- tibble(
x = x_vals,
`Gamma(0.5, 0.8)` = y1_vals,
`Gamma(1, 4)` = y2_vals,
`Gamma(2, 8)` = y3_vals,
`Gamma(5, 25)` = y4_vals
) %>%
pivot_longer(
!x,
names_to = "label",
values_to = "dens"
)
ggplot(plot_tbl) +
geom_line(aes(x = x, y = dens, colour = label)) +
labs(
x = "Value",
y = "Density",
title = "Comparison Density Plot for Gamma"
)So we now look at the effect of priors when combined with observed data.
Suppose over the period of a year a customer transacts 13 times in the year - this is fully consistent with the prior mean.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_4_dens <- dgamma(x_vals, shape = 1, rate = 4)
post_1_4_dens <- dgamma(x_vals, shape = 1 + 13, rate = 4 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 4) Prior"
)We now construct a similar comparison with a stronger prior on \(\lambda\).
x_vals <- seq(0.01, 1, by = 0.001)
prior_5_20_dens <- dgamma(x_vals, shape = 5, rate = 20)
post_5_20_dens <- dgamma(x_vals, shape = 5 + 13, rate = 20 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_5_20_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_20_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(5, 20) Prior"
)It might be more useful to compare the two posteriors directly - both are based on the same observed data.
ggplot() +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_20_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Comparison of Posterior Densities Based on 14 Observed Events"
)2.3 Prior and Data Mismatch
It is worth exploring how the prior and data interact when the disagree.
We use a few Gamma priors with a mean value of 0.5 and the same data (13 events in 52 weeks) and see what the effect is.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_2_dens <- dgamma(x_vals, shape = 1, rate = 2)
post_1_2_dens <- dgamma(x_vals, shape = 1 + 13, rate = 2 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 2) Prior"
)We try a tighter prior around the value (though the weighting is still in favour of the data)
x_vals <- seq(0.01, 1, by = 0.001)
prior_5_10_dens <- dgamma(x_vals, shape = 5, rate = 10)
post_5_10_dens <- dgamma(x_vals, shape = 5 + 13, rate = 10 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_5_10_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_10_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(5, 10) Prior"
)We also want to try this with a prior that is “stronger” than the data.
x_vals <- seq(0.01, 1, by = 0.001)
prior_50_100_dens <- dgamma(x_vals, shape = 50, rate = 100)
post_50_100_dens <- dgamma(x_vals, shape = 50 + 13, rate = 100 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_50_100_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_50_100_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(50, 100) Prior"
)We finally want to check what happens with a very weak prior in the wrong location.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_1_dens <- dgamma(x_vals, shape = 1, rate = 1)
post_1_1_dens <- dgamma(x_vals, shape = 1 + 13, rate = 1 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_1_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_1_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 1) Prior"
)3 Construct Flat Frequency Model
For first frequency model we fit each customer’s transaction frequency independent of the other customers. We start with the estimation that the average customer frequency is about one transaction every four weeks (once a month) so this value determines our initial prior distribution for the transaction frequency.
For our initial fit, we use only 1,000 eligible customers to reduce computation time and construct a dataset used to fit these models.
fit_1000_data_tbl <- customer_summarystats_tbl %>%
filter(
customer_id %in% id_1000
)
fit_1000_data_tbl %>% glimpse()## Rows: 1,000
## Columns: 9
## $ customer_id <chr> "00RC3IE4XG", "02NO5FZ0VS", "02RE0JNCFQ", "04COTHYBKJ", "…
## $ tnx_count <int> 4, 2, 53, 19, 1, 4, 27, 16, 25, 21, 16, 46, 16, 2, 8, 5, …
## $ first_tnx_ts <dttm> 2019-01-01 08:59:28, 2019-01-01 06:23:37, 2019-01-01 04:…
## $ last_tnx_ts <dttm> 2019-07-30 12:39:28, 2019-02-18 02:35:37, 2019-12-15 16:…
## $ btyd_count <dbl> 3, 1, 52, 18, 0, 3, 26, 15, 24, 20, 15, 45, 15, 1, 7, 4, …
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 30.021826, 6.834524, 49.786903, 51.422187, 52.000000, 45.…
## $ obs_freq <dbl> 0.09992730, 0.14631596, 1.04445139, 0.35004346, 0.0000000…
## $ emp_freq <dbl> 0.05769231, 0.01923077, 1.00000000, 0.34615385, 0.0000000…
We do the same for the 10k customers.
fit_10000_data_tbl <- customer_summarystats_tbl %>%
filter(
customer_id %in% id_10000
)
fit_10000_data_tbl %>% glimpse()## Rows: 10,000
## Columns: 9
## $ customer_id <chr> "000JSNK3XH", "004LAX8P44", "0068WIE4X2", "00O7MNGPZC", "…
## $ tnx_count <int> 29, 16, 9, 43, 11, 4, 6, 4, 1, 7, 26, 7, 2, 9, 9, 11, 11,…
## $ first_tnx_ts <dttm> 2019-01-01 19:52:38, 2019-01-01 23:38:35, 2019-01-01 12:…
## $ last_tnx_ts <dttm> 2019-11-19 10:48:17, 2019-12-20 23:24:14, 2019-11-15 09:…
## $ btyd_count <dbl> 28, 15, 8, 42, 10, 3, 5, 3, 0, 6, 25, 6, 1, 8, 8, 10, 10,…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 45.945997, 50.427147, 45.407391, 51.975108, 47.378704, 30…
## $ obs_freq <dbl> 0.60941109, 0.29745883, 0.17618277, 0.80807913, 0.2110652…
## $ emp_freq <dbl> 0.53846154, 0.28846154, 0.15384615, 0.80769231, 0.1923076…
3.1 Fit Stan Model
We start by fitting our first Stan model, fitting a Poisson rate for each individual customer.
## data {
## int<lower=1> n; // count of observations
##
## int<lower=0> btyd_count[n]; // observed number of transactions
## vector<lower=0>[n] tnx_weeks; // observed time-period of transactions
## }
##
## parameters {
## vector<lower=0>[n] lambda;
## }
##
## model {
## target += poisson_lpmf(btyd_count | lambda .* tnx_weeks);
## }
We now compile this model using CmdStanR.
freqmodel_flat_stanmodel <- cmdstan_model(
"stan_code/freqmodel_flat.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "freqmodel_flat"
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, btyd_count, tnx_weeks = all_weeks) %>%
compose_data()
freqmodel_flat_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 421,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 4.9 seconds.
## Chain 2 finished in 4.8 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 4.9 seconds.
## Chain 4 finished in 5.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 4.9 seconds.
## Total execution time: 5.1 seconds.
freqmodel_flat_stanfit$summary()## # A tibble: 1,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.19e+3 -4.18e+3 23.3 23.8 -4.22e+3 -4.15e+3 1.01 738.
## 2 lambda[1] 7.69e-2 7.02e-2 0.0393 0.0360 2.49e-2 1.49e-1 1.00 2018.
## 3 lambda[2] 3.80e-2 3.28e-2 0.0260 0.0241 7.03e-3 8.71e-2 1.00 1752.
## 4 lambda[3] 1.02e+0 1.01e+0 0.142 0.145 7.99e-1 1.25e+0 1.01 2089.
## 5 lambda[4] 3.65e-1 3.60e-1 0.0837 0.0809 2.44e-1 5.09e-1 1.00 2078.
## 6 lambda[5] 1.90e-2 1.35e-2 0.0188 0.0138 9.98e-4 5.51e-2 1.00 1560.
## 7 lambda[6] 7.66e-2 7.01e-2 0.0377 0.0342 2.81e-2 1.46e-1 1.00 2300.
## 8 lambda[7] 5.19e-1 5.09e-1 0.105 0.103 3.62e-1 7.01e-1 1.00 1865.
## 9 lambda[8] 3.08e-1 3.00e-1 0.0788 0.0784 1.91e-1 4.45e-1 1.00 2168.
## 10 lambda[9] 4.79e-1 4.72e-1 0.0968 0.0952 3.30e-1 6.48e-1 0.998 2658.
## # … with 991 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
freqmodel_flat_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_flat-1.csv, /home/rstudio/workshop/stan_models/freqmodel_flat-2.csv, /home/rstudio/workshop/stan_models/freqmodel_flat-3.csv, /home/rstudio/workshop/stan_models/freqmodel_flat-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.2 Check Model Fit
We now need to check the parameters of this fit against the data to see how effective our model is at capturing the data. In this case we have the benefit of knowing the ‘true’ data, and so we compare our model output against the input parameters.
freqmodel_flat_validation_tbl <- freqmodel_flat_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.0737775, 0.0760951, 0.0786928, 0.0757330, 0.1026330,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
Having constructed the validation data we now want to check the quantile of each ‘true’ value in the posterior distribution for the parameter. If our model is valid, this distribution will be uniform on \([0, 1]\).
plotvalid_tbl <- freqmodel_flat_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda)) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.3 Fit Larger Model
We now repeat this model with a larger dataset.
stan_modelname <- "freqmodel_flat_10k"
stan_data_lst <- fit_10000_data_tbl %>%
select(customer_id, btyd_count, tnx_weeks = all_weeks) %>%
compose_data()
freqmodel_flat_10k_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 422,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 82.5 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 83.9 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 85.9 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 86.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 84.6 seconds.
## Total execution time: 86.4 seconds.
freqmodel_flat_10k_stanfit$summary()## # A tibble: 10,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.18e+4 -4.18e+4 72.6 74.6 -4.19e+4 -4.17e+4 1.00 671.
## 2 lambda[1] 5.52e-1 5.46e-1 0.0983 0.100 4.00e-1 7.24e-1 1.00 2373.
## 3 lambda[2] 3.06e-1 3.00e-1 0.0778 0.0779 1.94e-1 4.46e-1 1.01 2315.
## 4 lambda[3] 1.73e-1 1.68e-1 0.0589 0.0586 8.82e-2 2.79e-1 1.01 2635.
## 5 lambda[4] 8.27e-1 8.23e-1 0.129 0.128 6.28e-1 1.05e+0 1.00 2703.
## 6 lambda[5] 2.12e-1 2.05e-1 0.0669 0.0644 1.16e-1 3.36e-1 1.00 3278.
## 7 lambda[6] 7.76e-2 7.10e-2 0.0392 0.0356 2.61e-2 1.50e-1 1.00 2918.
## 8 lambda[7] 1.17e-1 1.11e-1 0.0473 0.0446 5.05e-2 2.06e-1 1.00 3021.
## 9 lambda[8] 7.73e-2 7.17e-2 0.0369 0.0351 2.65e-2 1.45e-1 1.01 3181.
## 10 lambda[9] 1.91e-2 1.34e-2 0.0186 0.0137 1.00e-3 5.61e-2 1.00 2163.
## # … with 9,991 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
freqmodel_flat_10k_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_flat_10k-1.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_10k-2.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_10k-3.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_10k-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.4 Check Larger Model Fit
We now repeat the validation exercise once more, extracting the posterior customer \(\lambda\) values and comparing them to our input values.
freqmodel_flat_10k_validation_tbl <- freqmodel_flat_10k_stanfit %>%
recover_types(fit_10000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_10k_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 4
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.506860, 0.607616, 0.471304, 0.610821, 0.677388, 0.63…
## $ customer_lambda <dbl> 0.6374628, 0.6374628, 0.6374628, 0.6374628, 0.6374628,…
We also construct validation plots for this.
plotvalid_tbl <- freqmodel_flat_10k_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda), alpha = 0.2) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.5 Construct Observed-Time Frequency Model
We now want to update our approach to use the observed duration of the transactions rather than using a constant time period of 52 weeks.
stan_modelname <- "freqmodel_flat_obs"
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data()
freqmodel_flat_obs_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 423,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 4.8 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 finished in 4.9 seconds.
## Chain 3 finished in 4.8 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 4.9 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 4.8 seconds.
## Total execution time: 5.2 seconds.
freqmodel_flat_obs_stanfit$summary()## # A tibble: 1,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.02e+3 -4.02e+3 23.7 24.2 -4.06e+3 -3.98e+3 1.01 765.
## 2 lambda[1] 1.33e-1 1.23e-1 0.0653 0.0625 4.48e-2 2.49e-1 1.00 2451.
## 3 lambda[2] 2.88e-1 2.44e-1 0.205 0.182 4.88e-2 6.92e-1 1.00 1615.
## 4 lambda[3] 1.07e+0 1.06e+0 0.143 0.141 8.49e-1 1.31e+0 1.00 1945.
## 5 lambda[4] 3.68e-1 3.58e-1 0.0859 0.0837 2.40e-1 5.16e-1 1.00 2320.
## 6 lambda[5] 1.94e-2 1.39e-2 0.0191 0.0142 8.65e-4 5.64e-2 1.00 1884.
## 7 lambda[6] 8.70e-2 8.04e-2 0.0428 0.0386 2.88e-2 1.65e-1 1.00 1907.
## 8 lambda[7] 5.38e-1 5.33e-1 0.100 0.0980 3.80e-1 7.11e-1 1.00 2244.
## 9 lambda[8] 3.14e-1 3.04e-1 0.0780 0.0741 2.00e-1 4.57e-1 1.00 2475.
## 10 lambda[9] 4.93e-1 4.89e-1 0.0983 0.0965 3.41e-1 6.67e-1 1.00 2707.
## # … with 991 more rows, and 1 more variable: ess_tail <dbl>
Now that we have fit our models, we repeat our validation model much like we did the last time. We expect this model to deviate from our first model however as we have a less-reliable estimate of the observation time for each customer - as it is now calculated as ending at the last observation.
freqmodel_flat_obs_validation_tbl <- freqmodel_flat_obs_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_obs_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1511800, 0.0799502, 0.2983930, 0.0345082, 0.0995931,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
We also construct validation plots for this.
plotvalid_tbl <- freqmodel_flat_obs_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda), alpha = 0.2) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.6 Construct Posterior Distribution Comparisons
We now combine the outputted posterior distributions and compare them together.
comparison_distrib_tbl <- list(
`Fixed Time` = freqmodel_flat_validation_tbl,
`Observed Time` = freqmodel_flat_obs_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 4,000,000
## Columns: 5
## $ label <chr> "Fixed Time", "Fixed Time", "Fixed Time", "Fixed Time"…
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.0737775, 0.0760951, 0.0786928, 0.0757330, 0.1026330,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
Now that we have this data we can construct a number of visualisations to help us compare these two distributions on a per-customer basis.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
plot_tbl <- comparison_summary_tbl %>%
group_nest(customer_id) %>%
slice_sample(n = 30) %>%
unnest(data)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(aes(x = customer_id, y = customer_lambda)) +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))As there is a broad selection of different lambda values, we draw a random
selection of customer_id for different ranges of these values.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 15))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(
aes(x = customer_id, y = customer_lambda, group = label),
position = position_dodge(width = 0.75)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)3.7 Estimate Posterior Interval Coverage
One final validity check is to see how often the ‘true’ lambda value falls
within our posterior intervals. We create a boolean variable for both intervals
(the 50% and the 80%) and then check the proportion of values against the
width.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 2 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fixed Time 0.503 0.805 0.186 0.311 0.129 0.066
## 2 Observed Time 0.438 0.73 0.125 0.437 0.23 0.04
These numbers line up with what we expect, though we can see that using the
observed time as our input introduces an upward bias in inferred values for
lambda. We could assess further using bootstrap techniques, but I do not
think it is necessary.
We will leave repeating the above analysis for the larger datasets as an exercise.
4 Adding Priors to the Model
We now want to add priors to our model to improve our inferences. Our current
model provides no input for our prior knowledge on the values of lambda - but
a bit of thought will show this is not correct. For example, we know that a
value above 10.0 is almost impossible, and most customers will have a lambda
rate between 0 and 1 say, averaging around roughly 0.25.
A good choice for this distribution is the Gamma with a shape parameter of 1 and a ‘rate’ parameter of 4.
We can check what this looks like by simulation:
lambda_vals <- rgamma(10000, shape = 1, rate = 4)
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(1, 4)"
)The above distribution looks reasonable, though perhaps we would want a bit more probability mass at the higher end.
lambda_vals <- rgamma(10000, shape = 0.25, rate = 1.0)
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(0.25, 1)"
)This distribution looks a lot more skewed, so we also plot this histogram on a log-scale.
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
scale_x_log10() +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(0.25, 1)"
)4.1 Construct Stan Model with Priors on Lambda
This is ‘prior’ knowledge and we can set up these priors for the parameters within our model to allow for this prior knowledge.
## data {
## int<lower=1> n; // count of observations
##
## int<lower=0> btyd_count[n]; // observed number of transactions
## vector<lower=0>[n] tnx_weeks; // observed time-period of transactions
##
## real r;
## real alpha;
## }
##
## parameters {
## vector<lower=0>[n] lambda;
## }
##
## model {
## lambda ~ gamma(r, alpha);
##
## target += poisson_lpmf(btyd_count | lambda .* tnx_weeks);
## }
You can see we have now added code to say that our posterior inference on
each customer value for lambda is the combination of a prior Gamma
distribution and our Poisson likelihood for the observed counts.
We now compile this model using CmdStanR.
freqmodel_prior_stanmodel <- cmdstan_model(
"stan_code/freqmodel_prior.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)Having compiled the Stan model, we now fit it with our data.
stan_modelname <- "freqmodel_prior"
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data(
r = 0.25,
alpha = 1
)
freqmodel_prior_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 425,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 11.4 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 11.8 seconds.
## Chain 4 finished in 11.7 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 12.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 11.7 seconds.
## Total execution time: 12.1 seconds.
freqmodel_prior_stanfit$summary()## # A tibble: 1,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -3.31e+3 -3.31e+3 2.29e+1 2.21e+1 -3.35e+3 -3.27e+3 1.01 720.
## 2 lambda[1] 1.08e-1 9.70e-2 5.87e-2 5.58e-2 3.23e-2 2.15e-1 1.00 2372.
## 3 lambda[2] 1.57e-1 1.21e-1 1.33e-1 1.12e-1 1.50e-2 4.25e-1 1.00 2029.
## 4 lambda[3] 1.03e+0 1.02e+0 1.39e-1 1.37e-1 8.17e-1 1.27e+0 1.00 2593.
## 5 lambda[4] 3.47e-1 3.41e-1 7.81e-2 7.46e-2 2.28e-1 4.85e-1 1.00 2145.
## 6 lambda[5] 4.59e-3 8.81e-4 8.88e-3 1.30e-3 7.75e-8 2.29e-2 1.00 894.
## 7 lambda[6] 6.87e-2 6.28e-2 3.60e-2 3.54e-2 2.24e-2 1.38e-1 1.00 1869.
## 8 lambda[7] 5.11e-1 5.03e-1 1.02e-1 1.04e-1 3.57e-1 6.91e-1 1.00 1930.
## 9 lambda[8] 2.92e-1 2.85e-1 7.44e-2 7.28e-2 1.82e-1 4.22e-1 1.00 2249.
## 10 lambda[9] 4.70e-1 4.65e-1 9.88e-2 1.00e-1 3.23e-1 6.39e-1 0.999 2397.
## # … with 991 more rows, and 1 more variable: ess_tail <dbl>
Once again, we want to check the HMC diagnostics.
freqmodel_prior_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_prior-1.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-2.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-3.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
4.2 Check Posterior Lambda Values
We repeat our validation exercise from before, comparing the outputs of the
posterior distribution for each lambda against the known true value.
freqmodel_prior_validation_tbl <- freqmodel_prior_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_prior_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1013710, 0.0864285, 0.0846721, 0.1456680, 0.0776031,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
We now calculate the q-values of each parameter against the posterior distribution and plot a histogram of these q-values.
plotvalid_tbl <- freqmodel_prior_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)4.3 Comparing the Uniform Prior and Gamma Prior Posteriors
Now we have both posterior distributions based on the observed lifetime of the customer and it is useful to compare the two distributions.
comparison_distrib_tbl <- list(
`Flat` = freqmodel_flat_obs_validation_tbl,
`Prior` = freqmodel_prior_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 4,000,000
## Columns: 5
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1511800, 0.0799502, 0.2983930, 0.0345082, 0.0995931,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
We now can compare the outputs of the two posterior distributions based on using a uniform prior and Gamma prior.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
comparison_summary_tbl %>% glimpse()## Rows: 2,000
## Columns: 9
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00RC3IE4XG", "02NO5FZ0VS", "02RE0JNCFQ", "04COTHYBKJ"…
## $ q10 <dbl> 0.057849420, 0.074813270, 0.887696300, 0.263768600, 0.…
## $ q25 <dbl> 0.086321350, 0.133528250, 0.966570750, 0.308264000, 0.…
## $ q50 <dbl> 0.12289250, 0.24385750, 1.05678000, 0.35788950, 0.0138…
## $ q75 <dbl> 0.17175900, 0.38584100, 1.15885000, 0.42331550, 0.0270…
## $ q90 <dbl> 0.21766680, 0.56551450, 1.25484200, 0.48155410, 0.0439…
## $ mean <dbl> 0.13285014, 0.28844774, 1.06707382, 0.36753896, 0.0193…
## $ customer_lambda <dbl> 0.051180893, 0.042939056, 0.966878916, 0.417482035, 0.…
Now that we have summary stats of the two distributions we segment our data
into the different groups of known values for lambda and then plot the
range of values for the different distributions.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 15))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(
aes(x = customer_id, y = customer_lambda)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)And we also want to compare the two distributions for the posterior coverage.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 2 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Flat 0.438 0.73 0.125 0.437 0.23 0.04
## 2 Prior 0.476 0.779 0.224 0.3 0.134 0.087
4.4 Fitting with an Alternative Prior
We fit the model again with a narrower prior on lambda to investigate the
effect of this.
stan_modelname <- "freqmodel_prior_2"
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data(
r = 1,
alpha = 4
)
freqmodel_prior_2_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 425,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 finished in 5.4 seconds.
## Chain 2 finished in 5.4 seconds.
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 5.5 seconds.
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 6.4 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 5.7 seconds.
## Total execution time: 6.6 seconds.
freqmodel_prior_2_stanfit$summary()## # A tibble: 1,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -5.47e+3 -5.47e+3 22.4 22.2 -5.51e+3 -5.44e+3 1.01 802.
## 2 lambda[1] 1.18e-1 1.07e-1 0.0620 0.0548 3.95e-2 2.35e-1 1.00 3015.
## 3 lambda[2] 1.83e-1 1.51e-1 0.133 0.114 3.21e-2 4.42e-1 1.00 2438.
## 4 lambda[3] 9.83e-1 9.80e-1 0.137 0.136 7.67e-1 1.22e+0 1.00 2992.
## 5 lambda[4] 3.45e-1 3.38e-1 0.0797 0.0791 2.27e-1 4.86e-1 1.00 3290.
## 6 lambda[5] 1.76e-2 1.20e-2 0.0178 0.0124 1.02e-3 5.52e-2 1.00 1890.
## 7 lambda[6] 8.07e-2 7.23e-2 0.0410 0.0385 2.78e-2 1.60e-1 1.00 2370.
## 8 lambda[7] 4.95e-1 4.91e-1 0.0975 0.0978 3.46e-1 6.64e-1 1.00 1936.
## 9 lambda[8] 2.90e-1 2.84e-1 0.0750 0.0737 1.79e-1 4.21e-1 1.01 2665.
## 10 lambda[9] 4.61e-1 4.56e-1 0.0920 0.0926 3.22e-1 6.21e-1 1.00 2315.
## # … with 991 more rows, and 1 more variable: ess_tail <dbl>
Having fitted our model we now run our HMC diagnostics.
freqmodel_prior_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_prior-1.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-2.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-3.csv, /home/rstudio/workshop/stan_models/freqmodel_prior-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We repeat our validation exercise from before, comparing the outputs of the
posterior distribution for each lambda against the known true value.
freqmodel_prior_2_validation_tbl <- freqmodel_prior_2_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_prior_2_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.0738737, 0.0817076, 0.1629680, 0.1713690, 0.1267950,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
We now calculate the q-values of each parameter against the posterior distribution and plot a histogram of these q-values.
plotvalid_tbl <- freqmodel_prior_2_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)4.5 Construct the Comparison Plot
Once again we now construct the data to plot out the comparisons.
comparison_distrib_tbl <- list(
`Flat` = freqmodel_flat_obs_validation_tbl,
`Prior 1` = freqmodel_prior_validation_tbl,
`Prior 2` = freqmodel_prior_2_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 6,000,000
## Columns: 5
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG", "00RC3IE4XG"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1511800, 0.0799502, 0.2983930, 0.0345082, 0.0995931,…
## $ customer_lambda <dbl> 0.05118089, 0.05118089, 0.05118089, 0.05118089, 0.0511…
We now can compare the outputs of the two posterior distributions based on using a uniform prior and Gamma prior.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
comparison_summary_tbl %>% glimpse()## Rows: 3,000
## Columns: 9
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00RC3IE4XG", "02NO5FZ0VS", "02RE0JNCFQ", "04COTHYBKJ"…
## $ q10 <dbl> 0.057849420, 0.074813270, 0.887696300, 0.263768600, 0.…
## $ q25 <dbl> 0.086321350, 0.133528250, 0.966570750, 0.308264000, 0.…
## $ q50 <dbl> 0.12289250, 0.24385750, 1.05678000, 0.35788950, 0.0138…
## $ q75 <dbl> 0.17175900, 0.38584100, 1.15885000, 0.42331550, 0.0270…
## $ q90 <dbl> 0.21766680, 0.56551450, 1.25484200, 0.48155410, 0.0439…
## $ mean <dbl> 0.13285014, 0.28844774, 1.06707382, 0.36753896, 0.0193…
## $ customer_lambda <dbl> 0.051180893, 0.042939056, 0.966878916, 0.417482035, 0.…
Now that we have summary stats of the two distributions we segment our data
into the different groups of known values for lambda and then plot the
range of values for the different distributions.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 10))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 0.5) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_point(
aes(x = customer_id, y = customer_lambda)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)And we also want to compare the two distributions for the posterior coverage.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 3 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Flat 0.438 0.73 0.125 0.437 0.23 0.04
## 2 Prior 1 0.476 0.779 0.224 0.3 0.134 0.087
## 3 Prior 2 0.468 0.78 0.2 0.332 0.145 0.075
5 Validating the Frequency Models
We now turn our attention to validating our model fits.
For the purposes of this work, we work on the dataset of 10,000 customers over a subset of the time, then use the model to predict the transaction count for the remainder of the time period. This should help us validate the quality of our model.
For the purposes of this work, we fit our model on the data from the first nine months of the year.
break_date <- as.Date("2019-09-30")
full_weeks <- difftime(break_date, as.Date("2019-01-01"), units = "weeks") %>% as.numeric()
training_tbl <- customer_transactions_tbl %>%
filter(
customer_id %in% id_10000,
tnx_timestamp <= break_date
)
training_tbl %>% glimpse()## Rows: 140,766
## Columns: 5
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 19:52:38, 2019-01-16 20:04:48, 2019-01-23 23…
## $ tnx_amount <dbl> 234.83, 213.38, 214.90, 180.31, 234.99, 245.72, 226.07, …
test_tbl <- customer_transactions_tbl %>%
filter(
customer_id %in% id_10000,
tnx_timestamp > break_date
)
test_tbl %>% glimpse()## Rows: 45,046
## Columns: 5
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-10-04 05:57:29, 2019-10-08 04:26:34, 2019-10-13 11…
## $ tnx_amount <dbl> 227.54, 222.65, 224.33, 225.61, 210.94, 242.21, 242.29, …
We need to reconstruct the summary statistics
training_stats_tbl <- training_tbl %>%
calculate_transaction_summary_stats() %>%
transmute(
customer_id, tnx_weeks = full_weeks, btyd_count
)
training_stats_tbl %>% glimpse()## Rows: 10,000
## Columns: 3
## $ customer_id <chr> "000JSNK3XH", "004LAX8P44", "0068WIE4X2", "00O7MNGPZC", "0…
## $ tnx_weeks <dbl> 38.85714, 38.85714, 38.85714, 38.85714, 38.85714, 38.85714…
## $ btyd_count <dbl> 19, 10, 6, 34, 6, 3, 4, 2, 0, 5, 17, 5, 1, 5, 7, 7, 10, 15…
test_stats_tbl <- test_tbl %>% calculate_transaction_summary_stats()
test_stats_tbl %>% glimpse()## Rows: 8,236
## Columns: 9
## $ customer_id <chr> "000JSNK3XH", "004LAX8P44", "0068WIE4X2", "00O7MNGPZC", "…
## $ tnx_count <int> 9, 5, 2, 8, 4, 1, 1, 1, 8, 1, 3, 1, 3, 11, 3, 3, 1, 11, 7…
## $ first_tnx_ts <dttm> 2019-10-04 05:57:29, 2019-10-20 16:47:27, 2019-10-22 16:…
## $ last_tnx_ts <dttm> 2019-11-19 10:48:17, 2019-12-20 23:24:14, 2019-11-15 09:…
## $ btyd_count <dbl> 8, 4, 1, 7, 3, 0, 0, 0, 7, 0, 2, 0, 2, 10, 2, 2, 0, 10, 6…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 6.600277, 8.753649, 3.383628, 12.293109, 5.683321, 52.000…
## $ obs_freq <dbl> 1.2120703, 0.4569523, 0.2955407, 0.5694247, 0.5278604, 0.…
## $ emp_freq <dbl> 0.15384615, 0.07692308, 0.01923077, 0.13461538, 0.0576923…
5.1 Validating the Flat Model Fit
Now that we have set up our datasets we fit our model on the first nine months of the year.
This is the exact same model as before, so we do not need to recompile the Stan model, we just fit it with the subsetted data.
stan_modelname <- "freqmodel_flat_subset"
stan_data_lst <- training_stats_tbl %>%
compose_data()
freqmodel_flat_subset_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 424,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 80.7 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 82.0 seconds.
## Chain 1 finished in 82.5 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 83.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 82.0 seconds.
## Total execution time: 83.3 seconds.
freqmodel_flat_subset_stanfit$summary()## # A tibble: 10,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.02e+4 -4.02e+4 75.8 73.2 -4.04e+4 -4.01e+4 1.00 851.
## 2 lambda[1] 5.13e-1 5.06e-1 0.114 0.112 3.42e-1 7.25e-1 1.01 3043.
## 3 lambda[2] 2.85e-1 2.76e-1 0.0849 0.0790 1.61e-1 4.37e-1 1.00 3908.
## 4 lambda[3] 1.79e-1 1.70e-1 0.0664 0.0652 8.41e-2 2.98e-1 1.01 2711.
## 5 lambda[4] 9.00e-1 8.90e-1 0.154 0.144 6.65e-1 1.17e+0 1.00 3732.
## 6 lambda[5] 1.81e-1 1.73e-1 0.0691 0.0666 8.26e-2 3.09e-1 1.00 3136.
## 7 lambda[6] 1.03e-1 9.28e-2 0.0531 0.0499 3.30e-2 2.02e-1 1.00 3100.
## 8 lambda[7] 1.28e-1 1.20e-1 0.0561 0.0516 5.02e-2 2.29e-1 1.00 2893.
## 9 lambda[8] 7.74e-2 6.91e-2 0.0458 0.0434 1.98e-2 1.63e-1 1.00 3302.
## 10 lambda[9] 2.54e-2 1.78e-2 0.0253 0.0183 1.09e-3 7.26e-2 1.00 2258.
## # … with 9,991 more rows, and 1 more variable: ess_tail <dbl>
We need to check the HMC diagnostics for this fit.
freqmodel_flat_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_flat_subset-1.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_subset-2.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_subset-3.csv, /home/rstudio/workshop/stan_models/freqmodel_flat_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We use our fitted model to predict the transaction frequency for each customer, then use this to predict a range of transaction counts for each customer in the subsequent three months, finally comparing it to the observed counts.
tnx_window <- difftime(as.Date("2019-12-31"), as.Date("2019-10-01"), units = "weeks") %>%
as.numeric()
freqmodel_flat_subset_predict_tbl <- freqmodel_flat_subset_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
mutate(
pred_count = rpois(n(), lambda = lambda * tnx_window)
)
freqmodel_flat_subset_predict_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 6
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "0…
## $ lambda <dbl> 0.586582, 0.434214, 0.602092, 0.287689, 0.512740, 0.499130…
## $ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ pred_count <int> 8, 7, 3, 7, 5, 6, 8, 11, 10, 1, 8, 6, 2, 9, 7, 8, 7, 6, 2,…
As we did before, we now combine these predicted transaction counts with what was observed the dataset and we see if the q-values give us a uniform distribution across customers.
freqmodel_flat_predict_valid_tbl <- freqmodel_flat_subset_predict_tbl %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
left_join(test_stats_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, customer_lambda, lambda, pred_count, obs_count = btyd_count) %>%
replace_na(list(obs_count = 0))
freqmodel_flat_predict_valid_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 6
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ customer_lambda <dbl> 0.6374628, 0.6374628, 0.6374628, 0.6374628, 0.6374628,…
## $ lambda <dbl> 0.586582, 0.434214, 0.602092, 0.287689, 0.512740, 0.49…
## $ pred_count <int> 8, 7, 3, 7, 5, 6, 8, 11, 10, 1, 8, 6, 2, 9, 7, 8, 7, 6…
## $ obs_count <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
We now use this data to construct a histogram of the q-vals and compare that with what we would expect from a uniform distribution.
We first start by comparing the posterior distribution of lambda with the
known underlying value we used to generate the data.
plotvalid_tbl <- freqmodel_flat_predict_valid_tbl %>%
calculate_distribution_qvals(lambda, customer_lambda, customer_id)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for Simulated Transaction Frequencies"
)We can also repeat this analysis for the observed transaction counts.
plotvalid_tbl <- freqmodel_flat_predict_valid_tbl %>%
calculate_distribution_qvals(pred_count, obs_count, customer_id, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for Simulated Transaction Counts"
)As this work will be repeated for multiple models, we wrap all the above logic
into a function validate_frequency_model to perform the various steps
required to construct the validation data, and we need some tables of data
set up as well.
customer_lambda_tbl <- customer_simparams_tbl %>%
select(customer_id, customer_lambda)
obs_data_tbl <- test_stats_tbl %>%
select(customer_id, obs_count = tnx_count)
input_validation_tbl <- training_stats_tbl %>%
inner_join(customer_lambda_tbl, by = "customer_id") %>%
left_join(obs_data_tbl, by = "customer_id") %>%
replace_na(
list(obs_count = 0)
) %>%
transmute(
customer_id = customer_id,
customer_lambda = customer_lambda,
tnx_window = tnx_window,
obs_count = obs_count
)
input_validation_tbl %>% glimpse()## Rows: 10,000
## Columns: 4
## $ customer_id <chr> "000JSNK3XH", "004LAX8P44", "0068WIE4X2", "00O7MNGPZC"…
## $ customer_lambda <dbl> 0.63746283, 0.20599461, 0.18019778, 0.78503589, 0.1868…
## $ tnx_window <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ obs_count <dbl> 9, 5, 2, 8, 4, 0, 1, 1, 0, 1, 8, 1, 0, 3, 1, 3, 0, 11,…
5.2 Validating Out-of-Sample Frequency Models
As we will be repeating the above logic for each of the models constructed, we
write a function called validate_frequency_model to construct this data and
allow us to compare multiple models at once.
We now need to refit these models with our ‘training’ data like we did before
with our flat model in the previous section.
5.2.1 Gamma(0.25, 1) Prior
We refit our models with a Gamma prior with fixed parameters, first fitting with a \(\Gamma(0.25, 1)\) prior.
stan_modelname <- "freqmodel_prior_subset"
stan_data_lst <- training_stats_tbl %>%
compose_data(
r = 0.25,
alpha = 1
)
freqmodel_prior_subset_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 424,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 214.9 seconds.
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 217.4 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 218.9 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 224.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 218.9 seconds.
## Total execution time: 225.1 seconds.
freqmodel_prior_subset_stanfit$summary()## # A tibble: 10,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -3.19e+4 -3.19e+4 78.5 7.94e+1 -3.21e+4 -3.18e+4 1.00 683.
## 2 lambda[1] 4.87e-1 4.78e-1 0.111 1.09e-1 3.21e-1 6.86e-1 1.00 4746.
## 3 lambda[2] 2.58e-1 2.50e-1 0.0818 8.15e-2 1.39e-1 4.07e-1 1.00 4082.
## 4 lambda[3] 1.58e-1 1.49e-1 0.0664 6.30e-2 6.71e-2 2.79e-1 0.999 3488.
## 5 lambda[4] 8.59e-1 8.52e-1 0.147 1.44e-1 6.33e-1 1.12e+0 1.00 4004.
## 6 lambda[5] 1.57e-1 1.49e-1 0.0651 6.18e-2 6.58e-2 2.77e-1 1.00 3676.
## 7 lambda[6] 8.07e-2 7.28e-2 0.0441 3.98e-2 2.40e-2 1.63e-1 1.00 3241.
## 8 lambda[7] 1.06e-1 9.72e-2 0.0522 4.98e-2 3.59e-2 2.05e-1 1.00 3194.
## 9 lambda[8] 5.58e-2 4.75e-2 0.0372 3.20e-2 1.15e-2 1.27e-1 1.00 3640.
## 10 lambda[9] 6.15e-3 1.09e-3 0.0115 1.61e-3 2.29e-7 2.89e-2 1.00 1507.
## # … with 9,991 more rows, and 1 more variable: ess_tail <dbl>
Having fit with this data, we need to check the HMC diagnostics.
freqmodel_prior_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_prior_subset-1.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_subset-2.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_subset-3.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
5.2.2 Gamma(1, 4) Prior
We also refit the model with the less diffused \(\Gamma(1, 4)\) prior.
stan_modelname <- "freqmodel_prior_2_subset"
stan_data_lst <- training_stats_tbl %>%
compose_data(
r = 1,
alpha = 4
)
freqmodel_prior_2_subset_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 424,
output_dir = stan_modeldir,
output_basename = stan_modelname
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 94.5 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 95.3 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 95.6 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 95.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 95.3 seconds.
## Total execution time: 96.0 seconds.
freqmodel_prior_2_subset_stanfit$summary()## # A tibble: 10,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -5.40e+4 -5.40e+4 75.7 77.8 -5.42e+4 -5.39e+4 1.00 771.
## 2 lambda[1] 4.65e-1 4.56e-1 0.103 0.0968 3.10e-1 6.52e-1 1.00 3146.
## 3 lambda[2] 2.57e-1 2.48e-1 0.0782 0.0742 1.42e-1 4.00e-1 1.00 2921.
## 4 lambda[3] 1.64e-1 1.56e-1 0.0628 0.0607 7.41e-2 2.75e-1 1.01 2197.
## 5 lambda[4] 8.18e-1 8.11e-1 0.142 0.144 5.98e-1 1.06e+0 1.00 4082.
## 6 lambda[5] 1.63e-1 1.54e-1 0.0611 0.0570 8.01e-2 2.72e-1 1.00 2634.
## 7 lambda[6] 9.29e-2 8.45e-2 0.0474 0.0434 3.28e-2 1.80e-1 1.00 3271.
## 8 lambda[7] 1.15e-1 1.09e-1 0.0516 0.0487 4.47e-2 2.10e-1 0.999 2136.
## 9 lambda[8] 6.85e-2 6.20e-2 0.0384 0.0336 1.95e-2 1.40e-1 0.999 2956.
## 10 lambda[9] 2.34e-2 1.64e-2 0.0238 0.0167 1.00e-3 6.71e-2 1.00 2601.
## # … with 9,991 more rows, and 1 more variable: ess_tail <dbl>
And as before we check the HMC diagnostics.
freqmodel_prior_2_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/freqmodel_prior_2_subset-1.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_2_subset-2.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_2_subset-3.csv, /home/rstudio/workshop/stan_models/freqmodel_prior_2_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
5.2.3 Construct Model Comparison Plots
We now use each of these fitted models to construct a comparison dataset,
comparing both the known lambda value and observed transaction count against
the posterior distributions for both those values.
fit_comparison_tbl <- list(
`Flat Prior` = freqmodel_flat_subset_stanfit,
`Gamma(0.25, 1)` = freqmodel_prior_subset_stanfit,
`Gamma(1, 4)` = freqmodel_prior_2_subset_stanfit
) %>%
enframe(name = "model_label", value = "stanfit") %>%
mutate(
valid_data = map(
stanfit, validate_frequency_model,
input_data_tbl = input_validation_tbl
)
)
fit_comparison_tbl %>% glimpse() ## Rows: 3
## Columns: 3
## $ model_label <chr> "Flat Prior", "Gamma(0.25, 1)", "Gamma(1, 4)"
## $ stanfit <list> <<CmdStanMCMC>
## Inherits from: <CmdStanFit>
## Public:
## clone: function (deep = FALSE)
## cmdstan_diagnose: function ()
## cmdstan_summary: function (flags = NULL)
## data_file: function ()
## draws: function (variables = NULL, inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## init: function ()
## initialize: function (runset)
## inv_metric: function (matrix = TRUE)
## latent_dynamics_files: function (include_failed = FALSE)
## loo: function (variables = "log_lik", r_eff = TRUE, ...)
## lp: function ()
## metadata: function ()
## num_chains: function ()
## num_procs: function ()
## output: function (id = NULL)
## output_files: function (include_failed = FALSE)
## print: function (variables = NULL, ..., digits = 2, max_rows = getOption("cmdstanr_max_rows",
## profile_files: function (include_failed = FALSE)
## profiles: function ()
## return_codes: function ()
## runset: CmdStanRun, R6
## sampler_diagnostics: function (inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## save_data_file: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_latent_dynamics_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_object: function (file, ...)
## save_output_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_profile_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## summary: function (variables = NULL, ...)
## time: function ()
## Private:
## draws_: -40344.6 -40238.6 -40302.5 -40393.9 -40410.3 -40267.9 -4 ...
## init_: NULL
## inv_metric_: list
## metadata_: list
## read_csv_: function (variables = NULL, sampler_diagnostics = NULL, format = getOption("cmdstanr_draws_format",
## sampler_diagnostics_: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...
## warmup_draws_: NULL
## warmup_sampler_diagnostics_: NULL>, <<CmdStanMCMC>
## Inherits from: <CmdStanFit>
## Public:
## clone: function (deep = FALSE)
## cmdstan_diagnose: function ()
## cmdstan_summary: function (flags = NULL)
## data_file: function ()
## draws: function (variables = NULL, inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## init: function ()
## initialize: function (runset)
## inv_metric: function (matrix = TRUE)
## latent_dynamics_files: function (include_failed = FALSE)
## loo: function (variables = "log_lik", r_eff = TRUE, ...)
## lp: function ()
## metadata: function ()
## num_chains: function ()
## num_procs: function ()
## output: function (id = NULL)
## output_files: function (include_failed = FALSE)
## print: function (variables = NULL, ..., digits = 2, max_rows = getOption("cmdstanr_max_rows",
## profile_files: function (include_failed = FALSE)
## profiles: function ()
## return_codes: function ()
## runset: CmdStanRun, R6
## sampler_diagnostics: function (inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## save_data_file: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_latent_dynamics_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_object: function (file, ...)
## save_output_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_profile_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## summary: function (variables = NULL, ...)
## time: function ()
## Private:
## draws_: -32036.9 -31916.3 -31914.5 -31837.8 -31827.6 -31879.9 -3 ...
## init_: NULL
## inv_metric_: list
## metadata_: list
## read_csv_: function (variables = NULL, sampler_diagnostics = NULL, format = getOption("cmdstanr_draws_format",
## sampler_diagnostics_: 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ...
## warmup_draws_: NULL
## warmup_sampler_diagnostics_: NULL>, <<CmdStanMCMC>
## Inherits from: <CmdStanFit>
## Public:
## clone: function (deep = FALSE)
## cmdstan_diagnose: function ()
## cmdstan_summary: function (flags = NULL)
## data_file: function ()
## draws: function (variables = NULL, inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## init: function ()
## initialize: function (runset)
## inv_metric: function (matrix = TRUE)
## latent_dynamics_files: function (include_failed = FALSE)
## loo: function (variables = "log_lik", r_eff = TRUE, ...)
## lp: function ()
## metadata: function ()
## num_chains: function ()
## num_procs: function ()
## output: function (id = NULL)
## output_files: function (include_failed = FALSE)
## print: function (variables = NULL, ..., digits = 2, max_rows = getOption("cmdstanr_max_rows",
## profile_files: function (include_failed = FALSE)
## profiles: function ()
## return_codes: function ()
## runset: CmdStanRun, R6
## sampler_diagnostics: function (inc_warmup = FALSE, format = getOption("cmdstanr_draws_format",
## save_data_file: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_latent_dynamics_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_object: function (file, ...)
## save_output_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## save_profile_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE)
## summary: function (variables = NULL, ...)
## time: function ()
## Private:
## draws_: -54041.5 -54079.4 -54306.5 -54182.7 -54064.2 -54029.2 -5 ...
## init_: NULL
## inv_metric_: list
## metadata_: list
## read_csv_: function (variables = NULL, sampler_diagnostics = NULL, format = getOption("cmdstanr_draws_format",
## sampler_diagnostics_: 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...
## warmup_draws_: NULL
## warmup_sampler_diagnostics_: NULL>
## $ valid_data <list> [<tbl_df[20000000 x 6]>], [<tbl_df[20000000 x 6]>], [<tbl_…
We now now use these datasets and our other validation routines to check the
q-vals for both lambda and transaction count.
validation_data_tbl <- fit_comparison_tbl %>%
select(model_label, valid_data) %>%
unnest(valid_data)
validation_data_tbl %>% glimpse()## Rows: 60,000,000
## Columns: 7
## $ model_label <chr> "Flat Prior", "Flat Prior", "Flat Prior", "Flat Prior"…
## $ customer_id <chr> "000JSNK3XH", "000JSNK3XH", "000JSNK3XH", "000JSNK3XH"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ customer_lambda <dbl> 0.6374628, 0.6374628, 0.6374628, 0.6374628, 0.6374628,…
## $ post_lambda <dbl> 0.586582, 0.434214, 0.602092, 0.287689, 0.512740, 0.49…
## $ pred_count <int> 3, 11, 4, 5, 10, 8, 7, 8, 7, 5, 8, 9, 5, 7, 5, 3, 11, …
## $ obs_count <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
First we plot the comparison histograms for lambda:
valid_lambda_tbl <- validation_data_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda, model_label, customer_id)
ggplot(valid_lambda_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = 200), colour = "red") +
facet_wrap(vars(model_label), ncol = 1) +
labs(
x = "Quantile",
y = "Count",
title = "Comparison Histogram of q-vals for Lambda"
)valid_count_tbl <- validation_data_tbl %>%
calculate_distribution_qvals(pred_count, obs_count, model_label, customer_id)
ggplot(valid_count_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = 200), colour = "red") +
facet_wrap(vars(model_label), ncol = 1) +
labs(
x = "Quantile",
y = "Count",
title = "Comparison Histogram of q-vals for Count"
)6 Write Data to Disk
We now want to write a number of these values to disk
id_1000 %>% write_rds("data/id_1000.rds")
id_5000 %>% write_rds("data/id_5000.rds")
id_10000 %>% write_rds("data/id_10000.rds")
customer_summarystats_tbl %>% write_rds("data/synthdata_singleyear_summarystats_tbl.rds")
validation_data_tbl %>% write_rds("data/synthdata_singleyear_subset_comparison_distrib_tbl.rds")7 R Environment
options(width = 120L)
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.1 (2021-08-10)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-02-08
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.1.0)
## arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.1.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
## backports 1.3.0 2021-10-27 [1] RSPM (R 4.1.0)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.1.0)
## bayesplot * 1.8.1 2021-06-14 [1] RSPM (R 4.1.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0)
## bookdown 0.24 2021-09-02 [1] RSPM (R 4.1.0)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.1.1)
## bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.1.0)
## brms * 2.16.1 2022-01-24 [1] Github (paul-buerkner/brms@6fa5c02)
## Brobdingnag 1.2-6 2018-08-13 [1] RSPM (R 4.1.0)
## broom 0.7.9 2021-07-27 [1] RSPM (R 4.1.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.1.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.1.0)
## callr 3.7.0 2021-04-20 [1] RSPM (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.1.0)
## checkmate 2.0.0 2020-02-06 [1] RSPM (R 4.1.0)
## cli 3.1.0 2021-10-27 [1] RSPM (R 4.1.0)
## cmdstanr * 0.4.0 2022-01-24 [1] Github (stan-dev/cmdstanr@3233585)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.1.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.1)
## colorspace 2.0-2 2021-06-24 [1] RSPM (R 4.1.0)
## colourpicker 1.1.1 2021-10-04 [1] RSPM (R 4.1.0)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.1.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.1.0)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.1.0)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.1.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.1.0)
## data.table 1.14.2 2021-09-27 [1] RSPM (R 4.1.0)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
## digest 0.6.28 2021-09-23 [1] RSPM (R 4.1.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.1.0)
## distributional 0.2.2 2021-02-02 [1] RSPM (R 4.1.0)
## dplyr * 1.0.7 2021-06-18 [1] RSPM (R 4.1.0)
## DT 0.19 2021-09-02 [1] RSPM (R 4.1.0)
## dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0)
## fansi 0.5.0 2021-05-25 [1] RSPM (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.1.0)
## fs * 1.5.0 2020-07-31 [1] RSPM (R 4.1.0)
## furrr * 0.2.3 2021-06-25 [1] RSPM (R 4.1.0)
## future * 1.22.1 2021-08-25 [1] RSPM (R 4.1.0)
## gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.1.0)
## generics 0.1.1 2021-10-25 [1] RSPM (R 4.1.0)
## ggdist 3.0.0 2021-07-19 [1] RSPM (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] RSPM (R 4.1.0)
## ggridges 0.5.3 2021-01-08 [1] RSPM (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.1.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.1.0)
## gtools 3.9.2 2021-06-06 [1] RSPM (R 4.1.0)
## haven 2.4.3 2021-08-04 [1] RSPM (R 4.1.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.1.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.1.0)
## httpuv 1.6.3 2021-09-09 [1] RSPM (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
## igraph 1.2.7 2021-10-15 [1] RSPM (R 4.1.0)
## inline 0.3.19 2021-05-31 [1] RSPM (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.1.0)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.1.0)
## knitr 1.36 2021-09-29 [1] RSPM (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.1.0)
## later 1.3.0 2021-08-18 [1] RSPM (R 4.1.0)
## lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.1)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.1.0)
## lme4 1.1-27.1 2021-06-22 [1] RSPM (R 4.1.0)
## loo 2.4.1 2020-12-09 [1] RSPM (R 4.1.0)
## lubridate 1.8.0 2021-10-07 [1] RSPM (R 4.1.0)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.1.0)
## markdown 1.1 2019-08-07 [1] RSPM (R 4.1.0)
## MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.1)
## Matrix 1.3-4 2021-06-01 [2] CRAN (R 4.1.1)
## matrixStats 0.61.0 2021-09-17 [1] RSPM (R 4.1.0)
## mgcv 1.8-36 2021-06-01 [2] CRAN (R 4.1.1)
## mime 0.12 2021-09-28 [1] RSPM (R 4.1.0)
## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.1.0)
## minqa 1.2.4 2014-10-09 [1] RSPM (R 4.1.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.1.0)
## mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.1.0)
## nlme 3.1-152 2021-02-04 [2] CRAN (R 4.1.1)
## nloptr 1.2.2.2 2020-07-02 [1] RSPM (R 4.1.0)
## parallelly 1.28.1 2021-09-09 [1] RSPM (R 4.1.0)
## pillar 1.6.4 2021-10-18 [1] RSPM (R 4.1.0)
## pkgbuild 1.2.0 2020-12-15 [1] RSPM (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.1.0)
## posterior * 1.1.0 2021-09-09 [1] RSPM (R 4.1.0)
## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.1.0)
## processx 3.5.2 2021-04-30 [1] RSPM (R 4.1.0)
## projpred 2.0.2 2020-10-28 [1] RSPM (R 4.1.0)
## promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.1.0)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.1.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0)
## Rcpp * 1.0.7 2021-07-07 [1] RSPM (R 4.1.0)
## RcppParallel 5.1.4 2021-05-04 [1] RSPM (R 4.1.0)
## readr * 2.0.2 2021-09-27 [1] RSPM (R 4.1.0)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.1.0)
## rlang * 0.4.12 2021-10-18 [1] RSPM (R 4.1.0)
## rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.0)
## rmdformats 1.0.3 2021-10-06 [1] RSPM (R 4.1.0)
## rsconnect 0.8.24 2021-08-05 [1] RSPM (R 4.1.0)
## rstan 2.21.2 2020-07-27 [1] RSPM (R 4.1.0)
## rstantools 2.1.1 2020-07-06 [1] RSPM (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.1.0)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.1.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.1.0)
## shiny 1.7.1 2021-10-02 [1] RSPM (R 4.1.0)
## shinyjs 2.0.0 2020-09-09 [1] RSPM (R 4.1.0)
## shinystan 2.5.0 2018-05-01 [1] RSPM (R 4.1.0)
## shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.1.0)
## StanHeaders 2.26.6 2022-01-24 [1] local
## stringi 1.7.5 2021-10-04 [1] RSPM (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
## svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.1.0)
## tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.1.0)
## threejs 0.3.3 2020-01-21 [1] RSPM (R 4.1.0)
## tibble * 3.1.5 2021-09-30 [1] RSPM (R 4.1.0)
## tidybayes * 3.0.1 2021-08-22 [1] RSPM (R 4.1.0)
## tidyr * 1.1.4 2021-09-27 [1] RSPM (R 4.1.0)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.1.0)
## tzdb 0.2.0 2021-10-27 [1] RSPM (R 4.1.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
## V8 3.4.2 2021-05-01 [1] RSPM (R 4.1.0)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
## vroom 1.5.5 2021-09-14 [1] RSPM (R 4.1.0)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.1.0)
## xfun 0.27 2021-10-18 [1] RSPM (R 4.1.0)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.1.0)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.1.0)
## xts 0.12.1 2020-09-09 [1] RSPM (R 4.1.0)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.1.0)
## zoo 1.8-9 2021-03-09 [1] RSPM (R 4.1.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
options(width = 80L)